benchmark evaluation
Preliminary suggestions for rigorous GPAI model evaluations
Paskov, Patricia, Byun, Michael J., Wei, Kevin, Webster, Toby
This document presents a preliminary compilation of general-purpose AI (GPAI) evaluation practices that may promote internal validity, external validity and reproducibility. It includes suggestions for human uplift studies and benchmark evaluations, as well as cross-cutting suggestions that may apply to many different evaluation types. Suggestions are organised across four stages in the evaluation life cycle: design, implementation, execution and documentation. Drawing from established practices in machine learning, statistics, psychology, economics, biology and other fields recognised to have important lessons for AI evaluation, these suggestions seek to contribute to the conversation on the nascent and evolving field of the science of GPAI evaluations. The intended audience of this document includes providers of GPAI models presenting systemic risk (GPAISR), for whom the EU AI Act lays out specific evaluation requirements; third-party evaluators; policymakers assessing the rigour of evaluations; and academic researchers developing or conducting GPAI evaluations.
In-the-loop Hyper-Parameter Optimization for LLM-Based Automated Design of Heuristics
van Stein, Niki, Vermetten, Diederick, Bäck, Thomas
Large Language Models (LLMs) have shown great potential in automatically generating and optimizing (meta)heuristics, making them valuable tools in heuristic optimization tasks. However, LLMs are generally inefficient when it comes to fine-tuning hyper-parameters of the generated algorithms, often requiring excessive queries that lead to high computational and financial costs. This paper presents a novel hybrid approach, LLaMEA-HPO, which integrates the open source LLaMEA (Large Language Model Evolutionary Algorithm) framework with a Hyper-Parameter Optimization (HPO) procedure in the loop. By offloading hyper-parameter tuning to an HPO procedure, the LLaMEA-HPO framework allows the LLM to focus on generating novel algorithmic structures, reducing the number of required LLM queries and improving the overall efficiency of the optimization process. We empirically validate the proposed hybrid framework on benchmark problems, including Online Bin Packing, Black-Box Optimization, and the Traveling Salesperson Problem. Our results demonstrate that LLaMEA-HPO achieves superior or comparable performance compared to existing LLM-driven frameworks while significantly reducing computational costs. This work highlights the importance of separating algorithmic innovation and structural code search from parameter tuning in LLM-driven code optimization and offers a scalable approach to improve the efficiency and effectiveness of LLM-based code generation.
How to Evaluate Different Machine Learning Deployment Solutions
Reach out to us at deployML@wallaroo.ai for a free evaluation. The emergence of Big Data in decision-making to achieve strategic business objectives has led to machine learning (ML) becoming a key enabler for driving growth, achieving operational excellence, and bringing innovative products to market. This shift has come about as the primary obstacles for ML are being overcome: data engineering at scale and model development are no longer daunting to enterprises given the many efficient and simple solutions provided by cloud or 3rd-party vendors. As a result, ML went from something only the bleeding edge innovators (such as Netflix and Amazon) were doing, to now a strategic enabler for organizations in the "early majority" stage of adoption. However, enterprises soon find that building a machine learning model isn't the end of the road but just the beginning of a new set of challenges: Because this is all so new, most enterprises do not have a pre-defined set of parameters to evaluate the different solutions for operationalizing ML models. As a result, they are not sure which attributes will allow their AI-enabled products and operations to scale in the long term as they add more models, use more data, or build more complex models.